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Abstract 

This paper and its companion form an extended version of notes provided to 
participants in the Valencia September 2004 summer school on Data Analysis 
in Cosmology. The papers offer a pedagogical introduction to the problem 
of estimating the power spectrum from galaxy surveys. The intention is to 
focus on concepts rather than on technical detail, but enough mathematics is 
provided to point the student in the right direction. 

This first paper presents background material. It collects some essential 
definitions, discusses traditional methods for measuring power, notably the 
Feldman-Kaiser-Peacock (1994) |2] method, and introduces Bayesian analysis, 
Fisher matrices, and maximum likelihood. For pedagogy and brevity, several 
derivations are set as exercises for the reader. At the summer school, multiple 
choice questions, included herein, were used to convey some didactic ideas, 
and provoked a little lively debate. 


1 Introduction 

It was a flawlessly organised September summer school in the historic Mediter¬ 
ranean city of Valencia, whose narrow, marble-paved streets are so randomly 
variable that you got lost in them as easily as in one of the lectures on “Data 
Analysis in Cosmology” going on at the Palacio Pineda. 

The lecture on power estimation was one of the first lectures at the summer 
school, and it seemed sensible to make available to the students in advance a 
reference set of notes containing essential definitions and background material 
that would prove useful throughout the summer school. The present paper is 
a somewhat extended version of those notes. The background material in the 
notes was not presented at the lecture, but rather was left as homework for the 
student during the long hours of siesta. To facilitate self-study, several of the 
derivations are posed as exercises for the reader. Solutions are not included, 
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but the derivations contain enough guidance that the persistent student should 
be able to solve them. 

The power spectrum is the most important statistic that can be measured 
from large scale structure (LSS). During the lecture, the reasons for this being 
so were conveyed through the device of multiple choice questions, which are 
included in this paper, along with answers at the end of the paper. 

This paper is arranged as follows. Section [3 collects some essential def¬ 
initions of correlation functions, power spectra, and shot noise. Section |21 
discusses traditional methods for measuring power, notably the Feldman- 
Kaiser-Peacock (1994) [5] method. Section ^ introduces Bayesian analysis, 
Fisher matrices, and maximum likelihood. 

A separate companion paper focusses on the actual designated topic of 
the lecture. It covers the practical issues of measuring power spectra from 
observations, with an emphasis on using maximum likelihood techniques to 
measure power at large, linear scales. 


2 Definitions 

This section collects definitions of some of the jargon that you will encounter 
not only in this lecture but repeatedly throughout this summer school. It is a 
good idea to assimilate the jargon^. 

2.1 Correlation Function 

Let n(r) denote the observed number density of particles (galaxies) at 
position r in a survey. 

Let n{r) denote the selection function, the expected mean number of 
particles (galaxies) at position r given the selection criteria of the survey. 
Often but not always, the selection function is separable into a product of an 
angular selection function h(f) and a radial selection function h(r). 
The determination or measurement of the angular and radial selection func¬ 
tions of a survey is a non-trivial enterprise which is an essential prerequisite 
for measuring correlation functions or power spectra. 

The overdensity S{r) of particles (galaxies) at position r is defined by 

^ n(r) - fi(r) 

nir) 


^I’ve added some optional footnotes, like this, on Hilbert space. A Hilbert space 
is an infinite dimensional vector space equipped with an inner product. Hilbert space 
provides a compact, powerful, and unifying mathematical formalism, just as ordinary 
vectors do in finite-dimensional geometry. A density field is a vector in a Hilbert 
space; a covariance function is a matrix in Hilbert space; an n-point correlation 
function is a rank-n tensor in Hilbert space. 
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The correlation function (or 2-point correlation function) is the 

covariance of overdensities at separation = jr—r^ 

= {5{ri)5{rj)) . ( 2 ) 

In large scale structure (LSS), the correlation function is often, though 
not always, conventionally taken to refer to the covariance function in real 
space (as opposed to Fourier space or some other space). The assumption 
that the Universe is statistically homogeneous (= statistically translation 
invariant) means that the correlation function is a function only of the vector 
separation = Vi—Vj of two points. The assumption that the Universe is 
statistically isotropic (= statistically rotation invariant) means that the 
correlation function is a function only of the magnitude of the separation 
Tij = \rij \ of two points. 

2.2 Power Spectrum 

A Fourier mode S{k) is the Fourier transform of the overdensity^ 

^You may not be familiar with the practice of using the same symbol 5 for both 
real and Fourier space; but S is the same vector in Hilbert space, with components 
Sr in real space, or Sk in Fourier space. The essential property of Hilbert space is 
the existence of an inner product (= scalar product). In the present case, the inner 
product of two real-valued vectors Ur and br is defined to be 

a-b = a^br= j a(r)b(r)d?r. (3) 

In the index notation aJ'br, repeated indices imply implicit summation (which 
becomes integration over the infinite dimensional space of positions), just as in 
relativity and quantum mechanics. In repeated pairs of indices, one index is always 
raised, while the other is always lowered (though it is also common, for notational 
simplicity, to keep all indices lowered, which causes no ambiguity as long as it is 
implicitly understood that in contracting over paired indices, one index is always 
raised and the other lowered). In real space, the raised components of a real-valued 
vector are numerically equal to the lowered components, aJ" — ar, but this is a 
special feature of real space, and is not true in Fourier space, spherical harmonic 
space, or other spaces. 

Exercise 1. Show from equation 0 and the definition a{k) = J a{r)e''°'^ d^r of 
the Fourier transform that the inner product of vectors ak and bk in Fourier space 
is 

a-b = a^bk = J a{k)*b{k) (4) 

which is called Parseval’s theorem. Once you’ve set up the formalism, you can 
deduce by inspection that a^bk = a^br, since the inner product is by construction 
a scalar, independent of the representation of the vectors. Notice that in Fourier 
space, the raised components of a vector are equal to the complex conjugate of its 
lowered components, a*’ = {ak)* = ffl-fc. □ 
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S{k) ^ I S{r)e^'^- d\ , 6{r) = J S{k)e-^'^- ^ . (5) 

The allocation of factors of 2k here follows the standard convention in 

cosmology, which you would be wise to stick to even if you don’t like it. 
Other disciplines have their own conventions. 

The power spectrum P{k) is the Fourier transform of the correlation 
function 

Pik) = J CWe’^^-dV , ar) = J P(fc)e-''^- ^ . (6) 

Exercise 2. From the definitions (|21) of the correlation function and 0 of 

Fourier modes, and the relation between the power spectrum and the 
correlation function, show that the covariance of Fourier modes is^ 

{6ik,)dikj)) = (27t)3fe(fc, + k,)P{h) (7) 

where So denotes the (here 3-dimensional) Dirac delta-function. Show that the 
delta-function arises from the assumption of statistical translation invariance. 
Show that the fact that the power spectrum P{ki) is a function of the 
magnitude ki = \ki\ oi its argument follows from the assumption of statistical 
rotation invariance. □ 

2.3 2-Point Function 

The correlation function or power spectrum are both representations, ex¬ 
pressed respectively in real space and Fourier space, of the covariance 
function, also known as the 2-point function. 

The 2-point function is the 2nd member of an infinite sequence of n-point 
functions, which are proportional to the n’th order irreducible moments. 
The first irreducible moment is the mean. The key property of the irreducible 
moments is that they are additive over sums of independent density 
fields. 

2.4 Shot Noise 

Typically, a galaxy survey samples only some fraction of the galaxies present in 
any volume element of the survey. To proceed, one makes the assumption that 
the galaxies surveyed are selected randomly from some continuous underlying 
population. 

Exercise 4. Convince yourself of the theorem that: The correlation func¬ 
tion ^(r) of a discrete random sampling of a density field is equal to 
the correlation function of the original field. □ 

® Exercise 3. Show that the quantity (27l)®5_D(fci-|-fcj) in equation 0 is just 
the unit matrix in Hilbert space. In other words, show that = a*,, for 

any vector ak ■ □ 
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Actually, there’s a catch to the above theorem, which is that the correlation 
function of the randomly subsampled field is equal to that of the parent field 
at all separations except at zero separation. If it is allowed that a particle 
is considered to be a neighbour of itself, then the correlation function of the 
randomly subsampled field acquires an extra contribution, a delta-function at 
zero separation, which is the shot noise. 

As a general definition, the shot noise is the self-particle contribution 
to any statistic. In the case of the correlation function or power spectrum, 
the shot noise is the self-pair contribution, that is, the contribution from pairs 
consisting of a particle (galaxy) and itself. 

Exercise 5. Argue that the shot noise contribution to the correlation function 
at a point where the selection function is n{r) is 

(<5(r.)5(r,)).hot = . (8) 

n{ri) 


□ 


You might think that this is trickery. Can’t you just exclude self-pairs and 
disregard this shot noise nonsense? The answer is that if you go to another 
space, such as Fourier space, then the shot noise shows up in a way that is 
not so trivial to remove. 

Exercise 6. Show from equation © that the shot noise contribution to the 
covariance of Fourier modes is 

{6{ki)6{kj))shot = {l/n){ki + kj) (9) 

where (I/h)(fc) is the Fourier transform of l/h(r) 

(I/fi)(fc) = f [I/h(r)] e'^ ’’d^r . (10) 


□ 


According to equations and dia, the shot noise contribution to the 
variance of Fourier modes is {6(k)6(k)*)shot = (l/'h)(0). For any finite survey, 
this shot noise contribution is infinite. This simply reflects the fact that Fourier 
modes are waves extending to infinity, and that it would require an infinite 
survey to measure the the amplitude of a wave whose wavenumber is specified 
with infinite precision. Fourier modes of real finite surveys are subject to an 
uncertainty principle: the wavenumbers of their Fourier modes are not precise, 
but rather are smeared over some finite width Afc ~ 1 /??, where i? is a measure 
of the linear size of the survey. You will discover more about what happens 
in real surveys in the exercise in EH 
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Question 1. Why is the 2-point function (either the correlation func¬ 
tion or the power spectrum) the statistic of choice in characterizing 
LSS? All of the following are true, but which is the most important? 

A. Because it has a simple physical meaning: the correlation function ^(r) 
is the average excess over random of the probability of finding a particle 
(galaxy) at given separation r from another particle (galaxy). 

B. Because the 2-point function can be measured relatively easily from 
observations, essentially by counting pairs. 

C. Because the correlation function or power spectrum is the (co)variance of 
density (the 2nd irreducible moment), which is the lowest order irreducible 
moment after the mean (the 1st irreducible moment). 

D. Because the central limit theorem implies that a density distribution 
is asymptotically Gaussian in the limit where the density results from 
the average of many independent random processes; and a Gaussian is 
completely characterized by its mean and variance (the 1st and 2nd 
irreducible moments). 

E. Because the 2-point function satisfies a dynamical equation, a low order 
member of the BBGKY hierarchy of equations. 

Answer at end of paper. □ 

Question 2. What is the advantage of the power spectrum P{k) 
over the correlation function ^(r)? Which of the following is the most 
important? 

A. During the linear growth of fluctuations, the evolution of the Fourier mode 
5{k) at each wavevector k is independent of every other. 

B. The covariance matrix of Fourier modes 5{k) is a diagonal matrix, equa¬ 
tion o, whereas the covariance matrix of real space modes 5{r) is not a 
diagonal matrix, equation ©• 

G. The power spectrum is the covariance of Fourier modes; Fourier modes 
are eigenmodes of the translation operator; the density distribution is 
statistically translation invariant; hence the cosmic covariance matrix must 
commute with the translation operator. 

D. Estimates of the power P{k) at different wavenumbers k are uncorrelated, 
for Gaussian fluctuations, whereas estimates of the correlation function 
f(r) at different separations r are correlated. 

E. The power spectrum is easier to measure than the correlation function. 

Answer at end of paper. □ 
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3 Traditional Methods for Measuring Power 

Yu & Peebles (1969) ^2 and Peebles (1973) (5] were the first to characterize 
LSS with the power spectrum. Their methodology was complicated by the 
fact that they had only positions on the sky, not full 3-dimensional positions. 

Baumgart & Fry (1991) P first pointed out that you could measure the 
galaxy power spectrum P{k) in a redshift survey by the simple method of 
enclosing the survey in a box and Fourier transforming without having to 
bother about the detailed boundaries of the survey. This astonishing 
result appeared to be in stark contrast to measurements of the correlation 
function ^(r), where it was essential to worry about the survey boundaries. 

In an influential paper, Feldman, Kaiser & Peacock (1994) |5] proposed 
a variant of the Baumgart & Fry ^ method, in which each galaxy i is first 
weighted by 

^ l + nir,)P{k) 

where n{ri) is the selection function at the position of galaxy i, before 
Fourier transforming. FKP showed that this procedure provided an optimal 
estimate of power P{k) in the case that 

(1) the wavelength 2K/k is small compared to the scale of the survey, and 

(2) fluctuations are Gaussian. 

Physically, the FKP weighting m is an approximation to the inverse variance 
weighting. It weights volumes by 1/ [h“^(r) -|- P(fc)], which one recognizes as 
the reciprocal of the sum of the shot noise n~^{r) and the cosmic power P{k). 
This approximation is valid only in the “classical” limit where position r and 
wavenumber k are simultaneously measurable, which is condition (1) above. 
The condition (2) of Gaussianity comes from the fact that the thing being 
measured is power, a 2-point statistic, and the uncertainty in power involves, 
in addition to a product of 2-point terms, a 4-point term which vanishes only 
for Gaussian fluctuations. 

Notice that the FKP weighting dJ depends on the power spectrum P{k), 
which is the same thing that the FKP method aims to measure. In Bayesian 
analysis the P{k) in the FKP weight would be recognized as being part of the 
model (part of the prior). However, the FKP approach is not Bayesian, but 
rather follows traditional statistical methods. 

The FKP method is excellent for the intuition, and for quick, approximate 
estimates. However, it is not adequate for precision cosmology and the esti¬ 
mation of cosmological parameters. The FKP method is inadequate both at 
large scales where assumption (1) fails, and at small scales, where assumption 
(2) fails. The problem is not so much that the FKP weighting is suboptimal 
(though that is true), but rather that the FKP method does not yield a precise 
estimate of the variance and covariance of measured power. Moreover, it is 
not powerful enough to deal with all of the real world issues of actual galaxy 
surveys. 
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3.1 The Baumgart & Pry (1991) [T| Miracle. 

Exercise 7. Let S{r) denoted a weighted overdensity of galaxies at position 
r in a survey: 

S{r) = 'w{r) [n{r) — n{r)] = w{r)n{r)6{r) = W{r)S{r) . (12) 

Here w{r) is some arbitrary weighting (such as the FKP weighting) that 
you choose [with the proviso that the weighting must be chosen a priori, 
independent of the observed galaxy density n(r)]; and W{r) = w{r)n{r) is 
the product of the weighting function and the selection function. 

(a) Fourier modes of the weighted overdensity 

The Fourier transform of the weighted overdensity 5{r) is, by definition, 

5{k) = y 5(r)e''^ ’’ dV = J Wir)6{r)e'^-^d\ . (13) 

Show that S{k) equals the convolution of the Fourier transform W{k) of 
the window with the Fourier transform d{k) of the overdensity: 

/ r|3jL 

W{k-k')5{k’)j^ (14) 

This is just the standard result that multiplication in real space becomes 
convolution in Fourier space. 

(b) Covariance of Fourier modes of the weighted overdensity 

Assume that the covariance {d{ki)d{kj)) of (unweighted) overdensities 
in Fourier space is a sum of a cosmic term {2n)^SD{ki+kj)P{ki), equa¬ 
tion 0, and a shot noise term {l/n){ki+kj), equation 

(6{k,)S{kj)) = {2Tif5D(k^ + kj)P{k^) + {l/n){ki + kj) . (15) 

Show that the covariance of Fourier modes of the weighted overdensity is 

/ d^k 

Wik.-k)Wik,+k)Pik)j^ + Nik. + k,) (16) 

where the shot noise N{k) is^ 

“^Actually it is more accurate to use the actual value of the shot noise, which is 
N{k)= ^ w{ri)^ ( 17 ) 

galaxies i 

as opposed to equation GHJ which merely gives the expectation value of the 
shot noise. Shot noise is, by definition, the contribution to the covariance from self¬ 
pairs (pairs consisting of a particle and itself). Equation ill.11 . from which follows 
equation GHJ, is a statement about the average excess of neighbours of a particle. 
But in fact we know more about the shot noise than just its average: we know that 
each particle always has exactly one of itself as a neighbour, not merely on average. 
In statistics, an estimate that uses more prior information is always better than an 
estimate that uses less. 
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N{k) = 


w{rfn{r)&'^-^ d^r . 


(18) 


Hence conclude that the variance of Fourier modes of the weighted 
overdensity is 

Csimkr) = I\wik- fe')|' p(fc') ^ + iv(o). (19) 

Equation says that the variance of Fourier modes of weighted over¬ 
density is, after subtraction of the shot noise N{Q), equal to the power 
spectrum P{k) smoothed over a smoothing function given by the magni¬ 
tude squared of the |lF(fe) of the Fourier transform W(k) of the window. 
Let denote the integral over the window (a notation suggested by the 
fact that is the scalar product of the Hilbert-space vector Wk with 
itself) 

W^= J \W{k)\^ -§^= j • (20) 

Then a smoothed estimate P{k) of power at wavevector k is 


P{k) = 


f \W{k - k')\^P{k') d^k'/{2Kf 


(S{k)S{k)*) - N{0) 

VF2 ■ ^ ^ 


Equation 12111 is essentially Baumgart & Fry’s (1991) pQ remarkable result. 

(c) What does it mean? 

Suppose that the survey window W{r) has a characteristic size R. Ap¬ 
proximately what is the width of the smoothing window |kF(fc)|^ in the 
smoothed power spectrum, equation ? At what wavenumber k would 
you cease to trust the smoothed power spectrum P{k) as a reasonable 
estimate of the true power P{k)l What happens as the size R of the 
survey gets larger? [The important thing here is the concept rather than 
the mathematics. But if you want to see how this plays out mathematically, 
you might consider a survey window W(r) which happens to be a Gaussian 
W{r) = exp [—r2/(2i?2)] j centred at the origin, with a l-cr width of R.] 


□ 


3.2 The Feldman-Kaiser-Peacock (1994) P] Method 

In the previous exercise you obtained an estimate, equation m, of (smoothed) 
power P{k) at wavevector k. The estimate involved an arbitrarily adjustable 
weighting W{r), equation (tT^ . of volume elements in the survey. It is natural 
to try to choose this weighting W{r) to try to minimize the variance of the 
resulting estimate m of power. The FKP weighting, already given as equa¬ 
tion OH, is an approximation to the desired minimum variance weighting, 
valid under the two conditions stated immediately after equation dJ. 
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It proves surprisingly tricky to derive the FKP weighting in a rigorous 
way with a minimum of unnecessary assumptions. It would be nice to take 
you through the derivation in an exercise, but I could not devise an approach 
that was satisfactorily clean, insightful, and brief. You might like to consult 
the original Feldman, Kaiser & Peacock (1994) |2] paper to see how they did 
it. A more general derivation can be found in Hamilton (1997) 0. Perhaps 
the most elegant approach is to use the quadratic method of Tegmark (1997) 
cni, discussed in §12 of Paper 2. 

A core part of the FKP argument is the following. If the survey has 
characteristic linear size ~ i?, then the Fourier transform W(k) of the survey 
window will be a ball of width Ak ^ l/R around the origin A; = 0. It 
follows that at wavenumbers much larger than the reciprocal size of the 
survey, fc 3> 1 /i?, the smoothing window | VF(fe — fc') | in the Baumgart & Fry 
estimate (EJ will be narrowly concentrated around the target wavenumber 
k. To the extent that the power spectrum P{k') is slowly varying over the 
narrow window, it can be approximated by a constant, 

P{k') ss P(k) = constant . (22) 


If the power spectrum is interpreted as literally constant, then the covariance 
matrix of overdensities is diagonal in real space 


{5{ri)6{rj)) = Sd (r* 


rj) 


P{k) + 


1 

n{ri)_ 


(23) 


This indicates that each volume element of the survey can be approximated 
as being statistically uncorrelated with all other volume elements. If you buy 
into the notion that minimum variance weighting is inverse variance weighting 
f ii4.2l exDlains where that notion comes from), then equation it^ suggests that 
each volume element should be weighted by 


W{r) 


1 

P{k) + l/n{r) 


(24) 


In the present case, the thing of interest is not single volume elements, but 
rather pairs of volume elements. For the specific case of Gaussian fluctuations, 
the covariance of pairs is a product of covariances of singles (e.g. Hamilton 
1997 0 §2.1) 


((^*<5, - {6,6j)) {SkSi - {6kSi))) = {6A){6A) + {6,6i){Sj6k) (25) 

which is true in both real space and Fourier space (it is a covariant expression 
in Hilbert space). It follows that, for Gaussian fluctuations, the inverse 
variance weighting of pairs of volume elements is 

W{n)W{r,) = + [P{k) + l/n{r,)] ' 

Equation (EHll is the FKP pair weighting. 


(26) 
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4 Bayes, Fisher, and Mciximum Likelihood 

Bayesian statistics provides the modern mathematical framework for rigorous 
statistics. As explained below, E21 it gives special status to maximum 
likelihood as yielding the best estimate of a parameter or set of parameters. 

Fisher, Scharf & Lahav (1994) ^ were the first to apply a likelihood 
approach to large scale structure. Heavens & Taylor (1995) |B] may be credited 
with accomplishing the hrst likelihood analysis designed to retain as much 
information as possible at linear scales. With the work of Heavens & Taylor, 
maximum likelihood methods appeared on the LSS scene essentially fully 
fledged. 


4.1 Bayesian Statistics 


Traditional statistics. Measure the mean by measuring the mean; measure 
the variance by measuring the variance; and such-like naivety. 

Bayesian statistics. Measure the mean (or variance) by asking, what is 
the probability that the mean (or variance) takes such and such a value, given 
this set of observations and this set of assumptions (the prior). Commonly, 
the prior is subdivided into (a) assumptions that you assert are true, and (b) 
a model equipped with parameters whose values you wish to estimate. 

The foundation of Bayesian statistics foundation is Bayes’ theorem. 

Bayes’ Theorem; States that the posterior probability P{p\x,X) 
that the parameters p take on certain values, given the observational data 
X and prior assumptions X, is proportional to the likelihood function, 
the probability P{x\p, X) of the observations x given parameters p and 
prior assumptions X, multiplied by the prior probability P{p\X) of the 
parameters p given the prior assumptions X 


P{p\x,X) 


P{x\p,X)P{p\X) 

P{x\X) 


(27) 


□ 

The probability P{x\X) of the observations given the prior assumptions is an 
overall normalization constant which plays no role except to ensure that the 
integral over the posterior probability is 1. 

The likelihood function P(x\p, X), which converts by multiplication a 
prior probability into a posterior probability, encapsulates all the information 
provided by a set of observations. In recognition of this fundamental role, the 
likelihood is given its own letter, C. R. A. Fisher, who developed much of the 
formalism of likelihoods in the first half of the 20th century, never subscribed 
to Bayesian statistics - after all, as just remarked, the likelihood encapsulates 
all the information provided by a set of observations. Nevertheless, if you want 
to convert a likelihood into a (posterior) probability that the parameters take 
on a certain range of values, then you have to assume a prior probability 
distribution of parameters. 
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“If the prior matters, then you are not learning much from the data” - 
from an Aspen Center for Physics workshop in summer 1997. 


Question 3. Rank each of the following prior assumptions in order of 
probability of being true: 


A. The Universe is statistically homogeneous and isotropic. 

B. The growth of fluctuations is driven primarily by gravity. 

C. Fluctuations at linear scales are Gaussian. 

D. The Universe is spatially flat. 

E. The ACDM model, with Ua ~ 0.7, Uc ~ 0.26, and U;, « 0.04, is correct 
(A is the cosmological constant, c is Cold Dark Matter, and b is baryons). 

F. Galaxy bias b{k), defined to be the square root of the ratio of galaxy-galaxy 
power Pgg (k) to matter-matter power Pmm{k), 


b{kf 


Pggik) 


.{k) ’ 


(28) 


is constant at linear scales. 


Answer at end of paper. □ 


4.2 Fisher Information Matrix 


The Fisher information matrix (Fisher 1935 0) plays a fundamental role 
in Bayesian statistical analysis. Many of us in the held of large scale structure 
learned about Fisher matrices from the superb paper by Tegmark, Taylor & 
Heavens (1996) HU, and I can offer little better advice than to go read that 
paper! 

The term “optimal”, applied to some statistical estimate of a quantity, has 
acquired a bad reputation thanks to misuse. The Fisher matrix puts what is 
meant by “optimal” on a sound mathematical basis. It is well worth getting 
your brain around the Fisher matrix, because it will raise your understanding 
of statistics to a new level. 

The Fisher information matrix Fa/s of a set of parameters pa is formally 
defined to be minus the expectation value of the second derivative of the 
log-likelihood function with respect to the parameters: 


Fap = - 


InU 

dpadpp 


(29) 


Expectation value here means averaged over an ensemble of observational data 
X predicted by the likelihood function 

{t) = J tC{x\p)dx . (30) 

Since the likelihood C is multiplicative over statistically independent sets of 
observations, it follows that the Fisher matrix, equation (PI- is additive over 
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statistically independent observations, a sensible property for information to 
have. 

The power of the Fisher matrix derives from the Cramer-Rao inequality 
(Kendall & Stuart 1967 (7| § 17.15), which states that the variance of 

any unbiassed (see equation below) estimate Pa of a parameter pa must 
exceed the reciprocal of the diagonal element of the Fisher matrix: 

(Apl^) > —— (no summation over a) . (31) 

^ aoc 

You will derive the Cramer-Rao inequality in Exercise El below. 

To the extent that the likelihood function £ is a Gaussian about its 
maximum (this is distinct from the proposition that the likelihood function 
is Gaussian in the data), often a good approximation thanks to the central 
limit theorem, the Fisher matrix is approximately equal to the inverse of the 
covariance matrix of the parameters. You are probably familiar from your 
earliest statistical training with the notion that the “best” way to weight a 
set of data is by their inverse variance (an idea already encountered in a 
on the FKP weighting). Inverse variance weighting effectively weights data in 
proportion to the amount of information in each part. 

4.3 Maximum Likelihood 

An estimator p of a parameter p (the hat on the estimator p distinguishes 
it from the true value p) is some function p{x) of the observational data 
X. An estimate p of a parameter p is unbiassed if the expectation value, 
equation (EOll, of the estimate equals the true value 


(p) = p . 


(32) 


A theorem of fundamental importance (Kendall & Stuart 1967 [2] §18.5) 
states that if an unbiassed estimator attaining the Cramer-Rao bound exists, 
then it is the maximum likelihood estimator, the values Pa of the param¬ 
eters for which the likelihood attains its maximum value given the observed 
data: 


dlnC 

dpa 


= 0 . 


(33) 


It is this theorem that gives the maximum likelihood method its special status. 

Yet more theorems give conditions (see Kendall & Stuart 1967 [7], and 
Exercise EJd) below) under which an unbiassed estimator attaining the 
Cramer-Rao bound exists. For example, such an estimator exists if the 
likelihood function Z1 is a Gaussian about its maximum. The central limit 
theorem ensures that C is asymptotically Gaussian in the limit of a large 
amount of data. Thus an unbiassed estimator attaining the Cramer-Rao bound 
exists in the asymptotic limit of a large amount of data. 




14 


Andrew J S Hamilton 


Exercise 8. The Schwarz inequality. 

The basis of the Cramer-Rao inequality is the Schwarz inequality, which 
states, equation (Idhil below, that the correlation coefficient between two 
estimates must be less than or equal to one in absolute value. The Schwarz 
inequality is a powerful general result in statistics, and it is well worth knowing 
how to derive it. 


(a) Consider two estimators p and q, with (co)variances (Ap^), (ApAq), and 
(Aq^). It is evidently true that 


{{Ap + XAqf) > 0 


(34) 


for any real number A. For what value of A, in terms of the (co)variances, 
is the left hand side of equation m a minimum? 

(b) Hence derive the Schwarz inequality 


{ApAq) 

(Ap2)i/2^Ag2}i/2 


(35) 


The quantity inside the vertical bars on the left hand side of equation 13511 
is called the correlation coefficient of the estimators p and q. The 
Schwarz inequality states that correlation coefficient must lie in the 
interval [—1,1]. A correlation coefficient of 1 means that the estimators 
are perfectly correlated; a correlation coefficient of —1 means that the 
estimators are perfectly anti-correlated. 

(c) What relation between p and q must be satisfied for the Schwarz inequality 
to become an equality? [Answer: Ap must be proportional to Ag.j 


□ 


Question 4- Which of the following is true according to the Schwarz 
inequality? In the following, the subscripts g and m denote galaxies and 
matter, so that, for example, igg{r) and ^mm(c) are the galaxy-galaxy and 
matter-matter correlation functions, while ^gm(r) is the galaxy-matter cross 
correlation. 


A. 

Cgm(r)/[Cgg(r)^mm(r)]^^^ 

< 1- 

B. 

PUk)/[PsKik)Pmn.{k)f" 

< 


30wer spectrum. 



C. 

Pgm{k)/ [Pgg{k)Pnim{k)] 

< 


shot noise is subtracted. 

D. All of the above. 

E. None of the above. 


1, where P is the shot-noise-subtracted 
1, where P is the power spectrum before 


Answer at end of paper. □ 
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Exercise 9. The Cramer-Rao inequality. 

In this exercise you will derive the Cramer-Rao inequality, equation m- 
The derivation follows Kendall & Stuart (1967) [7] §17.15, which you should 
consult for more detail and generality. For simplicity, the exercise considers 
just a single parameter p. 

(a) The likelihood function C{x\p) is the probability of the data x given the 
parameter p, and so satisfies the normalization condition 

J jC{x\p) dx = 1 (36) 


for any value of the parameter p. Differentiate this with respect to the 
parameter p to obtain f (dC/dp) dx = 0, or equivalently 


91n£\ 

/ 


= 0 . 


Differentiate again with respect to p to obtain 


a^lnTV IfdlnCVX 

)~\\W) / 


(37) 


(38) 


You recognise the left hand side of (P|l as the Fisher information in the 
parameter p, equation (P|l. and you see from equation that this 
information must be positive. 

(b) Consider an unbiassed estimator p. Being unbiassed, equation the 
estimator must satisfy 


{p — p)= J{p — p)Cdx = 0 . (39) 

Differentiate this with respect to p to show that 

(c) Apply the Schwarz inequality to deduce from equation (HOJ, coupled with 
equation the Cramer-Rao inequality 

{{p - pf) > 

(d) In question |SKd) you obtained a condition for the Schwarz inequality to 
become an equality. What is the condition on the likelihood function 
C for the Cramer-Rao bound to be attained? [Before you rush to the 
conclusion that C must be Gaussian, consider (and prove) the fact that 
the Cramer-Rao bound is also attained by a Poission distribution, for 
which the likelihood of observing x counts over an interval during which 
the expected number of counts is p is (p)^e“P/x!.] 


□ 
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Answers to multiple choice questions 

Question ^ The correct answer is D. Whereas the other answers have to 
do with humanistics (A & B) or mathematics (C & E), D has to do with 
physics. If density fluctuations in the universe were originally generated as 
a superposition of many independent random processes, then the resulting 
primordial density field will be Gaussian. This is a generic, albeit not universal, 
prediction of inflation, where density fluctuations are seeded by quantum 
fluctuations of the inflaton field. The prediction of Gaussianity remains 
consistent with observations (Komatsu et al 2003 0). 

Question All of these are true except perhaps for E, the truth of which 
depends on methodology. For example, the Baumgart & Fry (1991) P method 
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described in ||31is about as easy as could be, but it is far from the best method. 
Answers A-D are all important. However, answer C is the most insightful, 
because it gives the fundamental reason - statistical homogeneity - for the 
power spectrum’s superiority over the correlation function. Answers A, B, 
& D can all be construed as consequences of the fundamental assumption of 
statistical homogeneity. It is for essentially the same reason that CMB folk use 
the spherical harmonic power spectrum Ci rather than the angular correlation 
function to characterise fluctuations in the CMB. The power spectrum Ci is 
the covariance of spherical harmonics; spherical harmonics are eigenmodes of 
the rotation operator; CMB fluctuations are statistically rotation invariant; 
hence the covariance matrix of CMB fluctuations must commute with the 
rotation operator. All else (the spherical harmonic analogue of answers A, B, 
& D) follows. 

Question 1 ^ This question generated some debate at the workshop. My own 
ordering was A-F in the same order as written, but many respectable people 
opined that B should come before A. I might even agree with them. 

Question m The correct answer is C. Answer A is not true, and B is true 
only in the limit of vanishing shot noise. Only the non-shot-noise-subtracted 
power spectra can be expressed in the form of the Schwarz inequality , as 

-< 1 . ( 42 ) 

If you are concerned about the mixture of fe and — fc in equation 1021 ), then 
split 5{k) into its real and imaginary parts (which are uncorrelated, with equal 
variances), and consider an estimator which is a sum of the real and imaginary 
parts. If you are concerned with the appearance of the vector wavevector 
k rather than its absolute value fc, then consider an estimator that is an 
arbitrarily weighted sum of modes 5{k) having the same wavenumber fc. 





